AITopics | brier score

Collaborating Authors

brier score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Calibration without labels in multiple testing

Wadekar, Adway S., Soloff, Jake A.

arXiv.org Machine LearningJun-19-2026

Large-scale hypothesis testing supports probability claims about individual hypotheses, as in empirical Bayes methods for estimating local false discovery rates. We study how such claims can be interpreted as approximately calibrated forecasts of the null hypothesis, yielding interpretable error probabilities even under model misspecification. Our approach draws conceptual inspiration from probabilistic forecasting but addresses a different challenge: unlike forecasting, where labels are eventually observed, in multiple testing the ground truth is never revealed, so calibration must be assessed stochastically and established indirectly. We address this challenge by constructing a set of pseudo-labels, derived from the spacings of ordered $p$-values, which have the local false discovery rate as their regression target. Our construction unlocks existing tools for assessing and performing post-hoc calibration in multiple testing. Notably, we find on a large-scale empirical survey of published psychology and neuroscience literature that the $q$-value, a popular error measure based on the false discovery rate, can be severely miscalibrated.

artificial intelligence, lfdr, machine learning, (19 more...)

arXiv.org Machine Learning

2606.19737

Country:

North America > United States > Michigan (0.40)
North America > United States > California (0.28)

Genre: Research Report (0.67)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Berta, Eugène, Holzmüller, David, Bach, Francis, Jordan, Michael I.

arXiv.org Machine LearningMay-29-2026

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.

artificial intelligence, calibration, machine learning, (16 more...)

arXiv.org Machine Learning

2605.30188

Country: Europe (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.67)
Leisure & Entertainment > Games (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Uncertainty-aware classification and triage of structural heart disease using electrocardiography and echocardiography metrics

Colebank, Mitchel J.

arXiv.org Machine LearningMay-25-2026

Machine learning methods provide a methodological innovation that can help screen for cardiovascular disease through noninvasive and readily available measurement modalities. Recent investments in using electrocardiogram (ECG) data to screen for structural heart disease (SHD) are one example, where ECGs provide a low-cost, available modality for screening. This has led to the EchoNext dataset, a paired ECG-echocardiogram data repository for testing new methods of SHD detection. However, relatively few studies have investigated how more probabilistic classification through Bayesian inference may improve uncertainty quantification in this setting. Moreover, few studies have considered how triage systems can be developed to alleviate healthcare bottlenecks, such as the review of data from underserved, rural clinics by expert sonographers for SHD assessment. In this study, we leverage existing ECG-echocardiogram data to compare frequentist and Bayesian neural network classifiers. We show that the Bayesian approach is comparable or better than frequentist methods in SHD classification, and that they have a more robust uncertainty quantification attached to them. We provide an example of how this uncertainty-aware classification scheme can be used for screening SHD, providing a proof-of-concept for how machine learning can help with triage in getting individuals expert sonographer input when SHD is highly likely or measurements are highly uncertain.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Machine Learning

2605.22968

Country: North America > United States (0.46)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Diagnostic Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
(2 more...)

Add feedback

d826f5aadb26db488b8686097ceea2d1-Paper-Conference.pdf

Neural Information Processing SystemsFeb-17-2026, 09:48:49 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
Europe > France (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

5f1eee2509599faeeb3570a887016a64-Paper-Conference.pdf

Neural Information Processing SystemsFeb-14-2026, 23:36:25 GMT

large language model, machine learning, pre-training loss, (21 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > Middle East > Jordan (0.04)
(17 more...)

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (0.46)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

5a5acfd0876c940d81619c1dc60e7748-Paper-Conference.pdf

Neural Information Processing SystemsFeb-14-2026, 06:21:38 GMT

brier score, large language model, machine learning, (21 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Israel (0.05)
Asia > Middle East > Iran (0.04)
North America > United States > California (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Overview (0.92)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine (1.00)
Government > Voting & Elections (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

455e1e30edf721bd7fa334fffabdcad8-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 15:54:39 GMT

algorithm, dataset, sequence, (16 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

10fb6cfa4c990d2bad5ddef4f70e8ba2-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 13:14:45 GMT

f-bs-cw, g-bs-cw, stationary point, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Incoherent Beliefs & Inconsistent Actions in Large Language Models

Pal, Arka, Kitanovski, Teo, Liang, Arthur, Potti, Akilesh, Goldblum, Micah

arXiv.org Artificial IntelligenceDec-5-2025

Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.1324

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.88)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms

Xu, Shuoyan, Zhang, Yu, Miller, Eric J.

arXiv.org Artificial IntelligenceNov-26-2025

Abstract--Ride-hailing platforms are characterized by high-frequency, behavior-driven environments, such as shared mobility platforms. Although survival analysis has been widely applied to recurrent events in other domains, its use for modeling ride-hailing driver behavior remains largely unexplored. T o the best of our knowledge, this study is the first to formulate driver idle behavior as a recurrent survival process using large-scale platform data. This study proposes a survival analysis framework that uses a Transformer-based temporal encoder with causal masking to capture long-term temporal dependencies and embeds driver-specific embeddings to represent latent individual characteristics, significantly enhancing the personalized prediction of driver retention risk, modeling how historical idle sequences influence the current risk of leaving the platform via trip acceptance or log-off. The model is validated on datasets from the City of T oronto over the period January 2 to March 13, 2020. The results show that the proposed Frailty-A ware Cox Transformer (F ACT) delivers the highest time-dependent C-indices and the lowest Brier Scores across early, median, and late follow-up, demonstrating its robustness in capturing evolving risk over a driver's lifecycle. This study enables operators to optimize retention strategies and helps policy makers assess shared mobility's role in equitable and integrated transportation systems. The purpose of this study is to model the driver retention behavior through a transformer-based survival model. Shared mobility services, such as ride-hailing, car-sharing, and bike-sharing, are becoming an increasingly prominent component of contemporary transportation systems. These services are central to the broader concept of Mobility as a Service (MaaS) [1], which aims to integrate various forms of transport into a unified and user-centric platform.

artificial intelligence, machine learning, recurrent event, (19 more...)

arXiv.org Artificial Intelligence

2511.19893

Country: North America > Canada (0.15)

Genre: